Addition Similarity


In [1]:
# Import libraries
import numpy as np
import pandas as pd
# Import the data
import WTBLoad
wtb = WTBLoad.load()

Question: I want to know how similar 2 additions are. For instance, I'm thinking of brewing a beer with plums and vanilla, and I want to know how similar they are.

How to get there: The dataset shows the percentage of votes that said a style-addition combo would likely taste good. So, we can compare the votes on each style for the two additions, and see how similar they are.


In [2]:
import math
# Square the difference of each row, and then return the mean of the column. 
# This is the average difference between the two.
# It will be higher if they are different, and lower if they are similar
def similarity(additionA, additionB):
    diff = np.square(wtb[additionA] - wtb[additionB])
    return diff.mean()

res = []
# Loop through each addition pair
for additionA in wtb.columns:
    for additionB in wtb.columns:
        # Skip if additionA and combo B are the same. 
        # To prevent duplicates, skip if A is after B alphabetically
        if additionA != additionB and additionA < additionB:
            res.append([additionA, additionB, similarity(additionA, additionB)])
df = pd.DataFrame(res, columns=["additionA", "additionB", "similarity"])

Top 10 most similar additions


In [3]:
df.sort_values("similarity").head(10)


Out[3]:
additionA additionB similarity
530 chamomile rose hips 0.011956
403 bourbon whiskey 0.013294
962 grapefruit lemon grass 0.013347
928 ginger juniper berries 0.013454
514 chamomile lemon pepper 0.013545
88 apple pear 0.013556
297 blackberry raspberry 0.013563
501 chamomile coriander 0.014286
265 blackberry cherry 0.014319
529 chamomile rhubarb 0.014383

10 Least Similar additions


In [4]:
df.sort_values("similarity", ascending=False).head(10)


Out[4]:
additionA additionB similarity
1432 red wine rye 0.159639
1243 oak red wine 0.152291
1238 oak piña colada 0.146401
1264 orange peel red wine 0.145567
876 cucumber port 0.143464
246 basil port 0.142268
1372 piña colada rye 0.139274
1366 piña colada port 0.137152
1405 port watermelon 0.132705
1434 red wine smoke 0.129498

Similarity of a specific combo


In [5]:
def comboSimilarity(additionA, additionB):
    # additionA needs to be before additionB alphabetically
    if additionA > additionB:
        addition_temp = additionA
        additionA = additionB
        additionB = addition_temp
    return df.loc[df['additionA'] == additionA].loc[df['additionB'] == additionB]
comboSimilarity('plum', 'vanilla')


Out[5]:
additionA additionB similarity
1391 plum vanilla 0.050466

But is that good or bad? How does it compare to others?


In [6]:
df.describe()


Out[6]:
similarity
count 1485.000000
mean 0.043910
std 0.025011
min 0.011956
25% 0.025427
50% 0.037313
75% 0.053579
max 0.159639

We can see that the plum vanilla combo is above the mean, and it's closer to the 75th percentile than the 50th percentile. So, we can conclude it's not likely a combo that will be great together, as it's not great in many of the same beers.